Skip to content

chore: align sandbox tooling and policies with upstream OpenShell#24

Merged
drew merged 2 commits intomainfrom
chore/align-sandbox-tooling-and-policies
Mar 13, 2026
Merged

chore: align sandbox tooling and policies with upstream OpenShell#24
drew merged 2 commits intomainfrom
chore/align-sandbox-tooling-and-policies

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Mar 12, 2026

Summary

Aligns the community base sandbox container and network policies with the upstream NVIDIA/OpenShell sandbox that is being removed from deploy/docker/sandbox/. After this change, all tools and policies from the upstream sandbox are present in the community repo.

Base Dockerfile (sandboxes/base/Dockerfile)

  • Added coding agents: Claude CLI (native installer), OpenCode CLI (1.2.18), Codex CLI (0.111.0)
  • Pinned versions for reproducibility: Node.js 22.22.1, npm 11.11.0, uv 0.10.8
  • Added python3.13-dev package (needed for native extensions)
  • Added @hono/node-server@1.19.11 transitive vulnerability fix (GHSA-wc8c-qw6v-h7f6)
  • Created writable /sandbox/.venv overlay with --system-site-packages so sandbox users can pip install
  • Set environment variables: PATH, VIRTUAL_ENV, UV_PYTHON_INSTALL_DIR in both Dockerfile ENV and .bashrc
  • Baked GitHub skill (skills/github/SKILL.md) — REST-only gh CLI usage guide
  • Created /sandbox/.claude/skills/ with symlinks into .agents/skills/ for agent discovery

Network Policies (openclaw/policy.yaml, nemoclaw/policy.yaml)

  • Added 3 missing policies: pypi (pip/uv package installs), cursor (Cursor IDE), opencode (OpenCode CLI)
  • Fixed vscode wildcard endpoints — replaced *.vo.msecnd.net and *.gallerycdn.vsassets.io with exact hosts (az764295.vo.msecnd.net, gallerycdn.vsassets.io) since OPA uses exact host matching
  • Removed hardcoded repo-specific rules — replaced johntmyers/alpha-claw and johntmyers/bravo-claw write rules with generic read-only access matching upstream
  • Renamed policies to match upstream naming: githubgithub_ssh_over_https, nvidianvidia_inference, github_reposgithub_rest_api
  • Normalized policy names to use hyphens (e.g., claude_codeclaude-code)
  • Kept community-specific extras: gitlab, nvidia_web, cluster_pods, inference

Upstream policy coverage

All 8 upstream network policies are now present: claude_code, github_ssh_over_https, nvidia_inference, github_rest_api, pypi, vscode, cursor, opencode.

drew added 2 commits March 12, 2026 18:46
Add missing coding agents (Claude CLI, OpenCode, Codex), pin versions
for reproducibility (Node.js 22.22.1, npm 11.11.0, uv 0.10.8), create
a writable /sandbox/.venv overlay, set PATH/VIRTUAL_ENV/UV_PYTHON_INSTALL_DIR
env vars, and bake the GitHub REST-only skill into the base image.

Align openclaw and nemoclaw network policies with upstream: add pypi,
cursor, and opencode policies; fix vscode wildcard endpoints that
silently fail with OPA exact-match; replace hardcoded repo-specific
write rules with generic read-only access; normalize policy names to
use hyphens.
…nd clean up policies

- Remove deadsnakes PPA, apt Python packages, and pip bootstrap; let uv
  manage the full Python 3.13 toolchain
- Merge Node.js install + npm upgrade into a single RUN layer
- Merge all npm global installs (vuln fixes + CLI tools) into one call
- Add uv cache clean after python install and venv creation
- Copy base policy.yaml into the image instead of just creating the dir
- Remove duplicate UV_PYTHON_INSTALL_DIR ENV and redundant mkdir
- Update syntax directive from dockerfile:1.4 to dockerfile:1
- Revert nemoclaw/openclaw policies to main and replace repo-specific
  rules (johntmyers/alpha-claw, bravo-claw) with generic placeholders
@drew drew force-pushed the chore/align-sandbox-tooling-and-policies branch from f89cc63 to 0d6d027 Compare March 13, 2026 01:46
@drew drew merged commit d430717 into main Mar 13, 2026
5 checks passed
drew added a commit that referenced this pull request Mar 13, 2026
The cluster_pods allowed_ips policy was accidentally removed in #24.
This policy allows sandbox binaries to reach services on the k3s
cluster pod network (10.42.0.0/16), which is required for internal
service communication.
drew added a commit that referenced this pull request Mar 13, 2026
The cluster_pods allowed_ips policy was accidentally removed in #24.
This policy allows sandbox binaries to reach services on the k3s
cluster pod network (10.42.0.0/16), which is required for internal
service communication.
factory-octavian pushed a commit to factory-octavian/OpenShell-Community that referenced this pull request Apr 1, 2026
…ons (!17)

Closes NVIDIA#24

## Problem

SSH sessions into the sandbox were **not entering the network namespace** or receiving proxy environment variables. This meant every command run via SSH (the only user-facing path) had unrestricted internet access, completely bypassing OPA network policy enforcement.

The root cause: the SSH server was started **before** the network namespace and proxy were created in `lib.rs`, so it never received the netns fd or proxy URL.

Additionally, **gRPC inference from within the sandbox did not work at all** — even after fixing the netns, multiple issues prevented the Python SDK from reaching the navigator server through the CONNECT proxy.

## Changes

### Sandbox binary

**Core fix — reorder startup + thread netns through SSH:**
- `lib.rs`: Move netns + proxy creation before SSH server start. Compute `ssh_netns_fd` and `ssh_proxy_url`, pass them to `run_ssh_server()`.
- `ssh.rs`: Thread `netns_fd` and `proxy_url` through the full SSH call chain into `spawn_pty_shell()`. Set proxy env vars on the shell command. Call `setns(fd, CLONE_NEWNET)` in `install_pre_exec()`.

**Proxy — control plane allowlist + IPv6 socket lookup:**
- `proxy.rs`: Connections to the navigator endpoint (derived from `NAVIGATOR_ENDPOINT`) are always allowed without OPA evaluation, logged with `engine=control_plane`. This is infrastructure the sandbox needs to function, not a user-configurable policy.
- `procfs.rs`: Extended `parse_proc_net_tcp` to also check `/proc/<pid>/net/tcp6`. gRPC C-core uses `AF_INET6` sockets even for IPv4 connections, so its TCP entries were invisible to the proxy's identity resolver. Also fixed port parsing to use `rsplit_once(':')` for correct IPv6 address handling.

**Proxy env vars — lowercase variants for gRPC C-core:**
- `process.rs` + `ssh.rs`: Added lowercase `http_proxy`, `https_proxy`, `grpc_proxy` alongside uppercase. gRPC C-core (libgrpc) checks lowercase first and was ignoring the uppercase-only vars.

### Server

- `sandbox/mod.rs`: Added `CAP_SYS_PTRACE` to sandbox pod security context. Required for the proxy (root) to read `/proc/<pid>/fd/` of sandbox-user processes for binary identity resolution.

### Python SDK

- `inference.py`: Strip `http://`/`https://` scheme from endpoint before passing to `grpc.insecure_channel()`, which expects `host:port`. Default `endpoint` and `sandbox_id` from `NAVIGATOR_ENDPOINT` / `NAVIGATOR_SANDBOX_ID` env vars so `Inference()` works inside sandboxes with no arguments.

### Build / infra

- `ci.toml`: Added `--cap-add=SYS_PTRACE` to `mise run sandbox` to mirror the k8s pod capabilities.

### Documentation

- `architecture/sandbox.md`: Documented `CAP_SYS_PTRACE` requirement and the full set of proxy env vars (uppercase + lowercase).

## Testing

All 29 sandbox unit tests pass. `mise run pre-commit` passes (fmt, clippy, all workspace tests, python lint).

E2E verified on a live cluster:

| Test | Result |
|------|--------|
| Proxy env vars (6 vars, upper+lowercase) | PASS |
| Blocked endpoints (google, anthropic via curl) | PASS — all denied |
| **gRPC inference from SSH session** (`Inference()` with no args, env var defaults) | **PASS** |
| Proxy log: navigator requests show `engine=control_plane` | PASS |
| Proxy log: blocked requests show `engine=opa` with correct deny reasons | PASS |
factory-octavian pushed a commit to factory-octavian/OpenShell-Community that referenced this pull request Apr 1, 2026
…tion (#145)

* fix(server): add field-level size limits to sandbox and provider creation

Closes NVIDIA#24

Add validate_sandbox_spec and provider field validation with named
constants. Configure explicit 1MB tonic max_decoding_message_size.
Inference routes excluded per #133 rearchitecture.

* chore: remove issue number references from code comments

---------

Co-authored-by: John Myers <johntmyers@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant